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Abstract —  In  this  paperwe  extend  a  graph  convolutional 
neural  network  ( GCNNs)  which  is  the  one  of  the  existing 
state-of-art  deep  learning  methods  using  the  notion  of 
capsule  networks  for  graph  classification.  Through 
experiments,  we  show  that  by  extending  GCNNs  using 
capsule  networks  can  significantly  overcome  the 
challenges  of  GCNNs  for  the  task  of  graph  classification. 
Keywords —  Capsule  Network,  Graph  Convolutional 
neural  networks. 

I.  INTRODUCTION 

Many  real-world  problems  are  represented  as  graphs, 
such  as  social  networks,  molecular  graph  stmctures, 
biological  protein -protein  networks,  ah  of  these  domains 
and  many  more  can  be  readily  modeled  as  graphs,  which 
capture  interactions  (i.e.,  edges)  between  individual 
units(i.e.,  nodes). 

In  many  of  these  problems,  the  input  data  is  in  the 
form  of  graphs,  and  the  graph  convolutional  neural 
networks  task  is  to  do  node  classification  and  graph 
classification.  Gkaph  classification,  or  the  problem  of 
identification  of  class  labels  of  graphs  in  a  dataset,  is  an 
important  problem  with  practical  apphcations  in  a  diverse 
set  of  fields.  Data  from  bioinformatics  [1], 
chemoinformatics  [2],  social  network  analysis [3],  urban 
computing[4],  and  cyber-security  [5]  can  aU  be  naturally 
represented  as  labeled  graphs. 

The  standard  graph  convolutional  neural  networks 
model  commonly  used  in  existing  deep  learning 
approaches  on  graphs,  especially  when  it  applied  to  the 
graph  classification  problem  it  face  some  limitations  : 

•  Loss  of  information  due  to  the  basic  graph 
convolution  operation. 

•  Gtaph  convolutional  neural  network  model  are 
equivariant  because  of  this  it  cannot  apply 
directly  to  graph  classification  problem,  since  it 
cannot  provide  any  guarantee  that  the  outputs  of 


any  two  isomorphic  graph  graphs  are  always  the 
same. 

•  Gtaph  convolutional  neural  networks  model  are 
limited  to  exploiting  global  information  for  the 
purpose  of  graph  classification 

II.  RELATED  WORK 

Many  different  techniques  have  been  proposed  to 
solve  the  graph  classification  problem.  One  popular 
approach  is  to  use  a  graphkemel  to  measure  similarity 
between  different  graphs  [6].  This  similarity  can  be 
measured  by  considering  various  stmctural  properties  like 
the  shortest  paths  between  nodes  [7],  the  occurrence  of 
certain  graphlets  or  subgraphs  [8],  and  even  the  stmcture 
of  the  graph  at  different  scales  [9]. 

Recently,  several  new  methods  which  generalize  over 
previous  approaches  have  been  introduced.  These 
methods  use  a  deep  learning  framework  to  learn  data- 
driven  representations  of  graphs  [10,  11,  12].  In  [10],  a 
method  is  introduced  that  generalizes  the  WeisfeUer- 
Lehman  (WL)  algorithm  by  learning  to  encode  only 
relevant  features  from  a  node’sneighborhood  during  each 
iteration.  Interestingly,  [11]  proposes  a  method  that 
processes  a  section  of  the  input  graph  using  a 
convolutional  neural  network.  However,  for  this  to  work 
for  graphs  of  arbitrary  sizes  the  method  relies  on  a 
labeling  step  that  ranks  aU  the  nodes  in  the  graph  which 
means  it  stiU  processes  the  entire  graph  initially 

III.  PROPOSED  MODEL  AND  CONTRIBUTION 

The  main  contributions  of  our  paper  can  be  summarized 
as  follows: 

1.  Proposing  a  novel  Gkaph  Convolution  neural 
network  with  Capsule  Networks  (GCNN- 
CapsNet)  model  which  is  based  on  the  capsule 
idea  of  capturing  high  information  output  in  a 
small  vector  instead  of  scaler  output  which  is 
current  used  on  GCNN  models. 
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2.  Replacing  the  max  pooling  and  node  aggregation 
which  is  current  methods  used  to  achieve  the 
graph  permutation  invariance  by  developing  a 
novel  graph  permutation  invariant  layer  which  is 
based  on  computing  the  covariance  of  the  data  to 
solve  graph  classification  problem. 

3.  Designing  GQ'lN-CapsNet  which  will  ejqDloit 
the  global  graph  stmcture  features  at  each  graph 
node. 

3.1  Proposed  model 

Our  proposed  model  has  Capsule  networks  as  core  idea 
behind  is  to  capture  more  information  in  local  node  pool 
beyond  what  captured  by  max  pooling  and  by 
aggregation,  which  graph  convolution  operation  is  used  in 
a  standard  GCNN  model.  The  new  information  is 
encapsulated  in  so  called  instantiation  parameters 
described  in  [13]  which  form  a  capsule  vector  of  highly 
informative  output 

3.2  GCNN  general  model 

For  the  basic  notations  let  consider  a  graphG  = 
iV,E,A)of  size  N  =  |F|  Where  V  is  the  vertex  set.F  the 
edge  set  and  A  =  [a^]  the  weighted  adjacency  matrix. 
The  standard  graph  Laplacian  is  defined  hyL  =  D  —  A  E 
-^nxn  ^  where  D  is  the  degree  matrix.  LetX  E  be 

the  node  feature  matrix,  whered  is  the  input  dimension. 
Before  describing  our  model  GCNN-CapsNet  let  start  by 
describing  a  general  GCNN  model.  Let  G  be  a  graph  with 
graph  Laplacian  LandX  E  be  a  node  feature 

matrix  then  general  form  of  a  GCNN  layer  output 
function /(X,L)  e  given  by  f{X,L)  = 

XW^)  (1) 

Where  L^X  is  graph  convolution  filter  of  polynomial 
form  with  degree  L  while  144  are  learning  weight 
parameters. 

3.3  Capsule  graph  function 

Capsule  graph  function  is  described  by  considering  an 
t'^^node  withxgvalue  and  the  set  of  its  neighborhood  node 
values  as  N (l)  =  {x0,x\,x7....,xk}  including  itself.  In 
the  standard  graph  convolution  operation,  the  output  is  a 
scalar  functionM^  ^  Mwhich  take  k  input  neighbors  at 
thet*^^  node  and  yields  an  output  given  by 
fi(x0,xl,x2,..,xk)  =  j^Xk£N(i)aik  Xk  (2) 

Where  represents  edge  between  nodes  i  and  k  . 

Our  capsule  graph  network,  we 

replace/(xO,  xl, ... . ,  xfcjwith  a  vector  valued  capsule 
function/:  ^  E^.  for  example,  consider  a  capsulea 

capsule  function  that  capture  higher  order  statistical 
moments  as  follows  ,  we  omit  the  mean  and  standard 
deviation  for  simplification 


fiixo,  xl,  ...,xn) 


1 

In(0  I 


^  aikxi 

kBN  (i) 


(3) 


>  a^i,xl 

-ksMi) 

3.4  GrajJi  Capsule  Vector  Dimension 

In  the  first  layer  of  graph  capsule  network  receives 
an  inputs  E  and  produces  a  nonlinear  output 

fO{,L)  E  since  the  graph  capsule  function 

produce  a  vector  of  pdimension,  the  feature  dimension  of 
the  output  in  subsequent  layers  can  quickly  blow  up  to  an 
unmanageable  value.  For  keeping  checking,  we  restrict 
the  feature  dimension  of  the  output  f^^^{X,L)  to  be 
always e  at  any  middle  layer  of  GCNN- 

CapsNet.  This  was  accomplished  by  flattening  the  last 
two  dimension  off(X,L)  and  carrying  out  graph 
convolution  in  usual  way  (for  example  see  equation  4  for 
flattening). 


3.5  GrafJi  Capsule  function  with  statistical  moments 

Considering  higher-order  statistical  moments  as 
instantiation  parameters  because  they  are  permutation -ally 
invariant  and  can  nicely  be  computed  through  matrix- 
multiplication  operations  in  a  fast  manner.  To  do  this 
htfpiXiL^he  the  output  matrix  corresponding  top^^ 
dimension.  Then  we  can  compute/,K.*f,  L)  containing 
statistical  moments  as  instantiation  parameters  as  follows 
/“  (X,L) 

k 

=  (X,L)Q . 0/p'“'\A',L))lVp“  (4) 

k  =  0 

Where  O  is  a  hadamard  product.  Here  to  keep  the  feature 
dimension  in  check  from  growing,  we  flatten  the  last  two 
dimension  of  the  input  as  iX,L)  £  E'^' 

And  perform  usual  graph  convolution  operation  followed 
by  a  linear  transformation  with  W^k  £  as  the 

learning  weight  parameter.  Wherep  is  used  to  denote 
both  the  capsule  dimension  as  well  the  order  of  statistic 
moments. 


3.6  Grajdi  permutation  invariant  layer 

The  permutation  invariant  feature  in  GCNN- 
CapsNat  model  of  computing  the  covariance 
affix,  L)layer  output  is  given  as  follows, 

C(/(A',L))  =  ^ifiX,L)  -  pYifiX.l)  -  p)  (5) 

Here  p  is  the  mean  of  of/(A,L)  output  andC(. )  is  a 
covariance  function.  Since  covariance  function  is 
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differentiable  and  does  not  depend  upon  the  order  of  row 
elements,  it  can  serve  as  permutation  invariant  layer  in 
GQSlN-CapsNet  model.  And  it  is  also  fast  in  computation 
due  to  a  single  matrix-multiplication  operation.  Here  we 
flatten  the  last  two  dimension  of  GCNN-CapsNet  layer 
output  of/(A,L)  e  order  to  compute  the 

covariance. 

In  addition,  covariance  provides  much  richer 
information  about  the  data  by  including  shapes,  norms 
and  angles  (between  node  hidden  features)  information 
rather  than  just  providing  the  mean  of  data.  In  fact  in 
multivariate  normal  distribution,  it  is  used  as  a  statistical 
parameter  to  approximate  the  normal  density  and  thus 
also  reflects  information  about  the  data  distribution.  This 
particular  property  along  with  invariance  has  been 
exploited  before  in  [14]  for  computing  similarity  between 
two  set  of  vectors.  One  can  also  think  about  fitting 
multivariate  normal  distribution  onof/(A, L)  but  it 
involves  computing  inverse  of  covariance  matrix  which  is 
computationally  ejqtensive. 

Since  each  element  of  covariance  matrix  is  invariant 
to  node  orders,  we  can  flatten  the  symmetric  covariance 
matrixC  £  * '‘f to  constmct  the  graph  invariant 

feature  vector/  £  .  otherpositive  note  ,  here 

the  output  dimension  of  /  does  not  depend  uponlV 
number  of  nodes  and  can  be  adjusted  according  to 
computational  constraints. 

It  is  quite  straightforward  to  see  that  the  feature 
dimension  order  of  a  node  does  not  depend  upon  the 
graph  node  ordering  and  hence  the  order  is  same  across 
aU  graphs.  As  a  result,  each  element  of  fl  and  /2  are 
always  comparable.  To  be  more  specific,  covariance 
output  compares  both  the  norms  sand  angles  between  the 
corresponding  pairs  of  feature  dimension  vectors  in  two 
graphs . 

3.7  GCNN-CapsNet  aobal  Features 

Another  desired  characteristic  of  graph  classification 
problem  is  to  capture  global  stmcture  of  graph.  For 
instance,  by  considering  only  node  degree  (as  a  node 
feature)  is  a  local  information  and  is  not  much  helpful 
towards  solving  graph  classification  problem  Also  by 
considering  spectral  embedding  as  a  node  feature  it  takes 
global  piece  of  information  into  account  and  has  been 
proven  successful  in  serving  as  a  node  vector  for 
problems  dealing  with  graph  semi -supervised  learning. 

We  define  a  global  feature  that  takes  full  graph  stmcture 
into  account  during  their  computation.  While  local 
features  only  depend  upon  some  k-hop  node  neighbors. 

Unluckily,  the  basic  design  of  GQSIN  model  can 
only  capture  local  stmcture  information  of  the  graph  at 
each  node. 

www.iiaers.com 
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Let  G  be  a  graph  withL  £  graph  laplacian 

andA  £  node  feature  matrix.  be  the 

output  function  of  a  GQ4N  layer  equipped  with 
polynomial  filters  of  degreek.  Then[/*'*^  (A,L)]toutput  at 
depends  upon  “only”  on  the  input  values  of  neighbors 
distant  at  most  “kl-hops”  away. 

Mathematical  we  can  proof  the  above  statement,  it  is 
easy  to  see  that  the  base  easel  =  1  holds  tme.  Let’s 
assume  it  also  holds  tme  for  L)  i.e.,  node 

output  depends  upon  neighbors  distant  up  tok  x  (1  —  1) 
hop  away.  Then  in 

/“(A,L)= 

We  focus  on  term, 

g(.X,L)=  [/^'-i^(A,L) . (6) 

Particularly  the  last  term  involving  Matrix 

multiplication  of  with  L)  will  result  in 

node  to  include  aU  node  information  which  are  at  most  k- 
hop  distance  away.  But  since  a  node  in  L)  at  a 

distancekx  (I  —  1)  hops,  we  have  node  containing 
information  at  most  k-l-k(l  — l)  =  kl  hops  distance 
away. 

GCNN  model  with  1  layers  can  capture  only  kl-hop 
local-hood  stmcture  information  at  each  node.  Thus 
employing  GCNN  for  graph  classification  with  say 
aggregation  layer  can  capture  only  average  variation  of 
kl-hop  local-hood  information  over  the  whole  graph.  To 
include  more  global  information  about  the  graph  one  can 
either  increase  k  (i.e,  choose  higher  order  graph 
convolution  filter)  or  I  (i.e,  the  depth  of  GCNN  model). 
Both  these  choice  make  the  model  complex  and  require 
more  data  sample  to  reach  satisfying  result.  However 
among  the  two,  we  prefer  increasing  the  depth  of  GCNN 
model  because  the  first  choice  leads  to  increase  in  the 
breadth  of  GCNN  layer  and  based  on  the  current 
understanding  of  deep  learning  theory,  increasing  the 
depth  is  favored  more  over  the  breadth. 

For  cases  where  graph  node  features  are  missing, 
like  social  network  datasets,  it  is  a  common  practice  to 
take  node  degree  as  a  node  feature.  Such  practice  can 
work  for  the  problems  like  graph  semi-supervised  where 
local  stmcture  information  drives  node  output  labels  (or 
classes).  But  in  graph  classification  global  features 
governs  the  output  labels  and  hence  taking  node  degree  is 
not  sufficient.  Of  course,  we  can  go  for  a  very  deep 
GCNN  model  that  requires  higher  sample  complexity  to 
achieve  satisfying  results. 

We  propose  to  incorporate  FGSD  features  in  our 
GCNN-CapsNet  model  computed  at  each  node.  FGSD 
features  capture  global  information  about  the  graph  and 
can  also  be  computed  in  fast  manner.  Specifically,  at  each 

node  FGSD  features  are  computed  as  histogram  of 
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the  multi-set  formed  by  taking  the  harmonic  distance 
between  all  nodes  and  the  node.  It  is  given  by, 

Six,y)  =  -—(0n(x)  -  0n(y))^  (7) 

2—1  An 

n=0 

Where  S(x,y)  is  the  harmonic  distance,x,y  are  any 
graph  node  and  An,0nC)  is  the  n'^^eigenvalue  and 
eigenvector  respectively. 

IV.  GCNN-CAPSNET  MODEL  CONFIGURATION 

Graph  Capsule 

jyPL'T  GRAPH  Convolutional  neural 

networic  layer[GC(h.p)] 

I 

The  computation  of  mean  and  covariance[M,C(.)] 

Fully  connected  layer  [FC(h)] 

♦ 

Softmax 

Here  GC(h,p)  is  a  graph  capsule  CNN  layer  with  h 
hidden  dimensions  and  p  instantiation  parameters.  As 
mentioned  earlier,  we  take  the  intermediate  tensor  which 
is  subsequently  pass  through  [M,C(.)]  layer  which 
computes  means  and  covariance  of  the  input.  Output  of 
[M,C(.)]  layer  is  the  passed  to  two  fuUy  connected  FC 
layers  with  h  output  dimensions  and  finally  connect  to  a 
softmax  layer  for  computing  calss  probabilities. 

4.1  Dataset 

Evaluating  the  GCNN-CapsNet  model  we  perform 
graph  classification  task  on  variety  of  benchmark 
datasets.  In  first  round  we  used  bioinformatics  datasets 
namely:  PROTEINS,  NCII09,NCII  and  ENZYMES.  In 
the  second  round  we  used  social  netrork  datasets  namely: 
COLLAB,IMDB-BINARYREDDrr-BINARY  and 

REDDIT-MULTI5K. 

4.2  Experimental  setup 

AU  experiments  were  performed  on  a  single  machine 
loaded  with  2xNVlDIA  TITAN  VOLTA  GPUs  and  64 
GB  RAM.  And  we  compare  our  method  with  both  deep 
learning  models  and  graph  kernels . 


Our  e}q3eriment,  we  employ  these  features  only  for 
datasets  where  node  feature  are  missing.  Although  this 
strategy  can  always  be  used  by  concatenating  EGSD 
features  with  original  node  feature  values  to  capture  more 
global  information.  Einally  our  whole  end  to  end  GCNN- 
CapsNet  learning  model  is  guaranteed  to  produce  the 
same  output  for  isomorphic  graphs 

Eor  deep  learning  approaches,  we  adopted  3  recently 
proposed  state-of-art  graph  convolutional  neural  network 
namely:  PATCHYSAN  (PSCN)[15],Diffusion 

CNNs(DCNN)[16],  Dynamic  Edge  CNN(ECC)[17]. 

Eor  graph  kernel  we  adopted  4  state-of-art  graphs  kernels 
for  comparison  namely:  Random  Walk(RW)  [18], 
Shortest  Path  Kemel(SP)[19]  ,  Graphlet  kernel  (GK)[20], 
Weisfeiler-Lehman  Sub -tree  Kernels  (WL) 

4.3  Grajrfi  Classification  Results 

Prom  table  1,  it  is  clear  that  our  GCNN-CapsNet 
model  consistently  outperforms  most  of  the  considered 
deep  learning  methods  on  bioinformatics  datasets  with  a 
significant  margin  of  1%  -6%  classification  accuracy  gain 
on  NCIl  datasets.  Again  ,this  trend  is  continued  to  be  the 
same  on  social  network  datasets  as  shown  in  Table2.  Here 
we  were  able  to  achieve  up  to  4%  accuracy  gain  on 
COLLAB  dataset  and  rest  were  around  1%  gain  with 
consistency  when  compared  against  other  deep  learning 
approaches . 

Our  GCNN-CapsNet  is  also  very  competitive  with 
state-of-art  graph  kernel  methods.  It  again  show  a 
consistent  performance  gain  of  1%  -3%  accuracy  on  many 
bioinformatics  datasets  when  compared  against  with 
strong  graph  kernels.  While  other  considered  deep 
learning  methods  are  not  even  close  enough  to  beat  graph 
kernels  on  many  of  these  datasets.  It  is  worth  mentioning 
that  the  most  deep  learning  models  are  also  scalable  while 
graph  kernels  are  more  fine-tuned  towards  handling  small 
graphs . 

For  social  network  datasets,  we  have  a  significant 
gain  of  at  least  4% -9%  accuracy(highest  being  on 
REDDrT-MULTI  dataset)  against  graph  kernels  as 
observed  in  Table  2.  But  this  is  expected  as  deep  learning 
methods  tend  to  do  better  with  the  large  amount  of  data 
available  for  training  on  social  network  datasets. 
Altogether,  our  GCNN-CapsNetmodel  shows  very 
promising  result  against  both  the  current  state-of-art  deep 
learning  methods  and  graph  kernels . 


Dataset 

PROTEINS 

NCI109 

NCIl 

ENZYMES 

No. Graphs 

1113 

4127 

4110 

600 

MaxGraph  Size 

620 

111 

111 

126 

Avg.Gkaph  Size 

39.80 

29.60 

29.80 

32.60 
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Deep  Learning  Methods 


PSCN[2016] 

75.00±2.51 

- 

76.34±L68 

- 

DCNN[2016] 

6L29±L60 

57.47±L22 

56.61±L04 

42.44±L76 

ECC[2017] 

- 

75.03 

76.82 

45.67 

GCNN-CapsNet 

76.40±4.17 

81.12±1.28 

82.72±2.38 

61.83±5.39 

Gtaph  Kernels 


RW[2003] 

74.22±0.42 

>73.00±0.21 

>lDay 

24.16±L64 

SP[2005] 

75.07±0.54 

73.00±0.21 

73.00±0.24 

40.10±L50 

GK[2009] 

7L67±0.55 

62.60±0.19 

62.28±0.29 

26.61±0.99 

WL[2011] 

74.68±0.49 

82.46±0.24 

82.19±0.18 

52.22±L26 

GCNN-CapsNet 

76.40±4.17 

81.12±L28 

82.72±2.38 

61.83±5.39 

Table  1. Classification  accuracy  on  bioinformatics  datasets.  Result  in  bold  indicates  the  best  reported  classification  accuracy. 
Top  table  compares  results  with  various  deep  learning  approaches  while  bottom  half  compares  results  with  graph  kernels. 
‘>lday’  represent  that  the  computation  ejceed  more  than  24hrs 


Dataset 

COTTAR 

IMDB-BINARY 

REDDIT-BINARY 

REDDIT-MULTI5K 

No. Graphs 

5000 

1000 

2000 

5000 

Max.Gkaph  Size 

492 

136 

3783 

3783 

Avg.Gkaph  Size 

74.49 

II  19.77 

429.61 

508.5 

Deep  Learning  Methods 


PSCN[2016] 

72.60±2.15 

7L00±2.20 

86.30±L58 

49.10±0.70 

DCNN[2016] 

52.11±0.71 

49.06±L37 

OMR 

OMR 

GCNN-CapsNet 

77.71±2.51 

71.69±3.40 

87.61±2.51 

50.10±1.72 

Gaph  Kernels 


GK[2009] 

72.84±0.28 

65.87±0.98 

77.34±0.18 

4L01±0.17 

GCNN-CapsNet 

77.71±2.51 

71.69±3.40 

87.61±2.51 

50.10±1.72 

Table  2.Classification  accuracy  on  social  network  datasets.  Result  in  boldindicates  the  best  reported  classification 
accuracy  .Top  table  compares  results  with  various  deep  learning  approaches  while  bottom  half  compares  results  with  graph 
kernels.  ‘>lday’  represent  that  the  computation  exceed  more  than24hrs.  ‘OMR’  is  out  of  memory  error. 


V.  CONCLUSION 

In  this  paper,  we  present  a  novel  Gtaph  convolutional 
neural  network  with  capsule  network  (GCNN-CapsNet) 
model  based  on  the  fundamental  capsule  idea  to  address 
some  of  the  basic  weaknesses  of  existing  GCNN  models. 
Our  GCNN-CapsNet  model  design  captures  more  local 
stmcture  information  than  traditional  GCNN  and  can 
provide  much  richer  representation  of  individual  graph 
nodes  or  for  the  whole  graph.  For  our  purpose  we  employ 
a  capsule  function  that  preserves  statistical  moment’s 
formation  since  they  are  faster  to  compute. 

We  propose  a  novel  permutation  invariant  layer 
based  on  computing  covariance  in  our  GCNN-CapsNet 
architecture  to  deal  with  graph  classification  problem 
which  most  GCNN  models  find  challenging.  This 
covariance  can  again  be  computed  in  a  fast  manner  and 
has  shown  to  be  better  than  adopting  aggregation  or  max 


pooling  layer.  We  also  propose  to  equip  our  GCNN- 
CapsNet  model  with  FGSD  features  explicitly  to  capture 
more  global  information  in  absence  of  node  features.  We 
finally  show  GCNN-CapsNet  superior  performance  on 
many  bioinformatics  and  social  network  datasets  in 
comparison  with  existing  deep  learning  methods  as  well 
as  strong  graph  kernels  and  set  the  current  state -of-art. 
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